[llm] support tensorwise fp8/int8 training #10612

lugimzzz · 2025-05-19T02:56:21Z

PR types

New features

PR changes

APIs

Description

新增支持功能：
1.新增权重scale和激活scale all_reduce_max,以支持不同TP和数据并行策略切分
2. 支持DP+TP+PP+Sharding stage1训练FP8/INT8训练，使用Unified Checkpoint对权重、optimizer存储
3. 哈达玛矩阵乘改用对角block 哈达玛矩阵
4. 统一FP8/INT8训练代码逻辑
5. 新增支持Triton版本FP8权重AdamW优化器（含bf16 moment和offload功能）
6. 支持主干模型FP8/INT8 LoRA

后续PR待支持功能：
1.目前FP8权重使用paddle.int8表示np.int8存储，后续修改为float8表示（待框架支持fp8 set_value和concat）
2. 对FP8/INT8 quant-matmul-dequant 过程进行性能加速和对Moe结构进行加速适配
3.FP8/INT8训练支持Sharding stage2/3（PP仅支持stage1 优先级不高）

paddle-bot · 2025-05-19T02:56:26Z

Thanks for your contribution!

codecov · 2025-05-19T03:31:14Z

Codecov Report

Attention: Patch coverage is 16.57895% with 317 lines in your changes missing coverage. Please review.

Project coverage is 46.90%. Comparing base (23e4c1a) to head (7935b87).
Report is 6 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/quantization/qat_utils.py	7.69%	72 Missing ⚠️
paddlenlp/utils/optimizer.py	7.69%	72 Missing ⚠️
paddlenlp/utils/adamw_triton.py	13.84%	56 Missing ⚠️
paddlenlp/transformers/conversion_utils.py	9.37%	29 Missing ⚠️
paddlenlp/transformers/model_utils.py	21.62%	29 Missing ⚠️
paddlenlp/quantization/quantization_linear.py	10.34%	26 Missing ⚠️
paddlenlp/quantization/hadamard_utils.py	16.66%	25 Missing ⚠️
paddlenlp/quantization/quantization_utils.py	25.00%	3 Missing ⚠️
paddlenlp/trainer/trainer.py	60.00%	2 Missing ⚠️
paddlenlp/utils/distributed.py	0.00%	2 Missing ⚠️
... and 1 more

❌ Your patch check has failed because the patch coverage (16.57%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (46.90%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #10612      +/-   ##
===========================================
- Coverage    46.91%   46.90%   -0.02%     
===========================================
  Files          799      800       +1     
  Lines       132460   132519      +59     
===========================================
+ Hits         62148    62157       +9     
- Misses       70312    70362      +50

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lugimzzz · 2025-05-21T05:55:53Z

paddlenlp/transformers/model_utils.py

@@ -478,8 +525,8 @@ def load_state_dict(
                        scale_dict.update(res_scale_dict)

            if device == "cpu":
-                for k in list(state_dict.keys()):
-                    with device_guard():
+                with device_guard():


减少反复set_device增加耗时

lugimzzz · 2025-05-21T05:57:19Z

paddlenlp/transformers/model_utils.py

+                    "weight_only_int4",
+                    "weight_only_int8",
+                ]
+            elif isinstance(config.quantization_config.weight_quantize_algo, dict):


weight_only_int8不支持不同TP分片共享同一个scale，暂不支持wint8权重灵活转化TP策略

post_quantize 代表先TP切分权重再量化（针对wint4/wint8）

lugimzzz · 2025-05-21T05:58:35Z

paddlenlp/transformers/model_utils.py

@@ -2537,6 +2615,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
            # load pt weights early so that we know which dtype to init the model under
        if not is_sharded and state_dict is None:
            # 4. loading non-sharded ckpt from the state dict
+            # Quantization: Loading non-sharded ckpt does not support saving with merge_tensor_parallel


暂时不考虑非safetensor权重的量化加载和保存

nepeplwu · 2025-05-23T09:22:36Z

paddlenlp/quantization/hadamard_utils.py

+    return block
+
+
+def create_hadamard_matrix(block_size, dtype):


和前面random_hadamard_matrix的区别是什么

nepeplwu · 2025-05-23T09:25:07Z

paddlenlp/quantization/hadamard_utils.py

+    if getattr(infohub, "hadamard") is None:
+        setattr(infohub, "hadamard", {})
+
+    if block_size in infohub.hadamard:


hadamard_matrix 没有默认值的话，没有命中该分支会出问题

nepeplwu · 2025-05-23T09:31:54Z

paddlenlp/trainer/trainer.py

@@ -2107,16 +2109,6 @@ def get_optimizer_cls_and_kwargs(args: TrainingArguments) -> Tuple[Any, Any]:

            optimizer_cls = AdamWCustom
            optimizer_kwargs.update(adam_kwargs)
-        elif args.optim == OptimizerNames.ADAMW_16BIT_MOMENT:


为什么需要去掉这两种adamw实现

nepeplwu · 2025-05-23T09:32:35Z

paddlenlp/trainer/trainer_utils.py

@@ -318,8 +318,6 @@ class OptimizerNames(ExplicitEnum):
    ADAFACTOR = "adafactor"
    ADAMW_MINI = "adamw_mini"
    ADAMW_CUSTOM = "adamw_custom"
-    ADAMW_16BIT_MOMENT = "adamw_16bit_moment"


这两种在以往的配置中没有被使用到吗？

nepeplwu · 2025-05-23T09:50:45Z

paddlenlp/trainer/training_args.py

@@ -868,6 +868,10 @@ class TrainingArguments:
        default="adamw",
        metadata={"help": "The optimizer to use."},
    )
+    use_lowprecision_moment: bool = field(
+        default=False,
+        metadata={"help": "AdamW use lowbit moment as parameter."},


这里的lowbit是指多少位，什么情况下建议开启，开启后的影响是什么，需要明确解释下

nepeplwu · 2025-05-23T09:51:12Z

paddlenlp/trainer/training_args.py

@@ -996,6 +1000,10 @@ class TrainingArguments:
        default=False,
        metadata={"help": "Offload optimizer after optimizer.step()"},
    )
+    tensorwise_offload_optimizer: Optional[bool] = field(


help信息没解释清楚，为什么需要这个

minghaoBD · 2025-05-23T11:29:22Z

llm/run_finetune.py

@@ -445,7 +452,9 @@ def compute_metrics_do_generation(eval_preds):
        gen_args=gen_args,
        data_args=data_args,
    )
-    trainable_parameters = [p for p in model.parameters() if not p.stop_gradient]
+    trainable_parameters = [
+        p for p in model.parameters() if not p.stop_gradient or ("quantization_linear" in p.name and "w_1" in p.name)


这里的hardcode可以避免吗？或者如何保证一定生效？至少需要有log提示

minghaoBD · 2025-05-23T11:46:02Z

paddlenlp/quantization/quantization_linear.py

+                if self.weight_quantize_algo not in ["fp8linear", "a8w4linear", "fp8linear"]:
+                    self.quant_scale.is_distributed = False
+                else:
+                    self.quant_scale.is_distributed = True if self.is_mp else False


要考虑DP吗？

minghaoBD · 2025-05-23T12:11:27Z

paddlenlp/quantization/qat_utils.py

+                scale = paddle.max(paddle.abs(target_x)) / qmax
+                if group is not None:
+                    paddle.distributed.all_reduce(scale, op=paddle.distributed.ReduceOp.MAX, group=group, sync_op=True)
+                if state < quantization_config.apply_online_actscale_step:


这里的online scaling是和delayed scaling对应的吗？不知道这个参数影响了什么？建议给用户解释下

lugimzzz added 4 commits May 16, 2025 17:16

support uc

43ec0ac

add hadamard

fe356c3

add hadamard

0fbc564

Merge branch 'new' of https://github.com/lugimzzz/PaddleNLP into new

2e70aad

lugimzzz added 5 commits May 19, 2025 21:34

add distributed

fb0224a

add distributed

9a4f89e

add distributed

925a532

add new

336eb31

add offload optimizer

d718d40

lugimzzz changed the title ~~add uc~~ [llm] support tensorwise fp8 training May 21, 2025

lugimzzz commented May 21, 2025

View reviewed changes

lugimzzz changed the title ~~[llm] support tensorwise fp8 training~~ [llm] support tensorwise fp8/int8 training May 21, 2025

lugimzzz added 8 commits May 21, 2025 21:27

support moe model

1bfb4d9

support dp

587cecc

support dp

fc1d2c4

support uc

c158ab7

support uc

7e60cbf

support dp all_reduce

becedb7

fix quantiztaion config

ed78605

fix quantiztaion config

7935b87

nepeplwu reviewed May 23, 2025

View reviewed changes

minghaoBD reviewed May 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llm] support tensorwise fp8/int8 training #10612

[llm] support tensorwise fp8/int8 training #10612

Uh oh!

lugimzzz commented May 19, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented May 19, 2025

Uh oh!

codecov bot commented May 19, 2025 •

edited

Loading

Uh oh!

lugimzzz May 21, 2025

Uh oh!

lugimzzz May 21, 2025

Uh oh!

lugimzzz May 21, 2025

Uh oh!

lugimzzz May 21, 2025

Uh oh!

nepeplwu May 23, 2025

Uh oh!

nepeplwu May 23, 2025

Uh oh!

nepeplwu May 23, 2025

Uh oh!

nepeplwu May 23, 2025

Uh oh!

nepeplwu May 23, 2025

Uh oh!

nepeplwu May 23, 2025

Uh oh!

minghaoBD May 23, 2025

Uh oh!

minghaoBD May 23, 2025

Uh oh!

minghaoBD May 23, 2025

Uh oh!

Uh oh!

[llm] support tensorwise fp8/int8 training #10612

Are you sure you want to change the base?

[llm] support tensorwise fp8/int8 training #10612

Uh oh!

Conversation

lugimzzz commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented May 19, 2025

Uh oh!

codecov bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lugimzzz commented May 19, 2025 •

edited

Loading

codecov bot commented May 19, 2025 •

edited

Loading